Prosper Loan Data by Marie Leaf

Introduction

An exploratory data analysis (EDA) is performed on the Prosper lending data. I am curious to learn more about the dynamics of P2P loans, as they emerge as a financial product in the American economy, and perhaps how they may develop as instruments supporting the Title III of the Jobs Act (which permits companies to offer and sell securities through crowdfunding). I am ultimately curious about how distributed systems (built with blockchain) may develop a (perhaps autonomous with Ethereum) P2P lending platform…but this is beyond the scope of this EDA.

While I use various techniques to scan over a swath of different variables, I focus on loans’ listing categories, status, estimated return to lenders, and the Prosper rating, which provides a measure of loan risk.

I think loan status is an interesting variable to understand to potentially build something like a lender dashboard, to see loans outstanding and segment their outstanding loans by risk score, to better know which loans to focus on.

I think listing categories are an interesting variable to explore as loan use is a very important factor in understanding the sustainability and dynamics of the marketplace - are these loans that foster revenue generation for it’s borrowers, are they consumer product loans? Will these loans be affected by Title III of the jobs act? Should Prosper be focused on growing a certain sector of loans?

## [1] "/Users/marieleaf/CODE/datanano/P4_explore_analyze"

Clean up the data a bit

Factor key ordered factor variables that exist as integers (Term, ListingCategory)

added label names to listing categories to use as a character variable

Univariate Plots Section

Plot 1 - Loan origination date

We see a huge dip in the amount of loans in 2008-2009, with the number peaking in 2013-2014 to above the 30k count. This dip in 2008 can be attributed to the regulatory issues that Prosper needed to address.

Looking at the monthly trends, we see loans peaking above 90k in October, December, and January and dipping to ~75K in April. I would have imagined more variability between the months, but perhaps variations come with Loan Category, State, or other categorical variables.

Plot 2 - Variable distributions with Count Histograms

To first explore the distributions of variables of interest, I’ll draw up quick histograms and box plots.

In the Listing Category histogram, we see a very dispersed count among the different categories. This is important to keep note of for analyzing the categories in later plots. The largest category of loans is #1, or ‘Debt Consolidation’

In the Loan Original Amount histogram, we can see that the majority of loans are below the 10000k threshold. An quick judgement would suggest that the typical profile of a Prosper loan is a medium risk, smaller <10k, debt consolidation loans, with monthly payments below a thousand dollars.

Besides ‘home improvement’ most of the loans are categorized in either arbitrary and/or opaque categories (debt consolidation, other and NA) As a lender on this platform, or for better system intelligence, it would be beneficial if Prosper was more rigorous in requiring lendees to categorize loans, or qualify the debt consolidation loans. This would increase transparency and give the marketplace more accuracy in it’s predictions and qualitative data to understand dynamics of the loans. I looked into the dynamics of debt consolidation loans and found a explanatory blog post that a borrower wrote, which details how helpful the user found Prosper to be in refinancing his credit card debt, an industry I do find especially predatory. Here is another beneficial post detailing how users benefit from P2P lending platforms.

Interesting to note here that the majority of loans have 36 month terms. The percentage of 36 month loans have a spike of ~$175 Monthly loan payments, and 28/29% Effective Yields. Now how to further segment those estimated effective yields?

Box Plots for categorical data

Looking at the means of Effective Yield by ProsperRating, we see a likely negative correlation of the mean Yield by Prosper Rating.

The highest mean of the effective yields by category (excluding loans without a listed category) is found in loans that are for cosmetic procedures with a mean yield of 0.2052 (and a higher median of 0.2254), followed closely by household expenses with a mean of 0.2015.

Interesting to note here that cosmetic loans have a relatively conservative spread in their estimated returns, with very few outliers, and overall positive returns. The spread of their estimated yield is comparatively similar with their IQR being between 14.7% and 25.9%. The largest IQR range seems to be with ‘green loans’, between 11.3% and 26.4%. This is likely due to the variable nature of ‘green loans’ and it would be interesting to look further into the actual use of these loans.

Student loans have the lowest estimated effective yields (mean of 10.3% and a median of 10.2%), as yields are prospective returns, this is concurrent with the news about student loan debt in the United States

For reference, here is the definitive difference between yield and return: While both terms are often used to describe the performance of an investment, they’re not the same thing. Knowing what each measure takes into account and recognizing that each considers different time periods is key.

Return, also referred to as “total return”, expresses what an investor has actually earned on an investment during a certain time period in the past. It includes interest, dividends and capital gain (such as an increase in the share price). In other words, return is retrospective, or backward-looking. It describes what an investment has concretely earned.

Yield, on the other hand, is prospective, or forward-looking. Furthermore, it measures the income, such as interest and dividends, that an investment earns and ignores capital gains. This income is taken in the context of a certain time period and then annualized, with the assumption that the interest or dividends will continue to be received at the same rate. Yield is often used to measure bond or debt performance; in most cases, total return will not be the same as the quoted yield due to fluctuations in price. # Univariate Analysis

What is the structure of your dataset?

## [1] 113937     37
## 'data.frame':    113937 obs. of  37 variables:
##  $ ListingNumber            : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ EstimatedReturn          : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ ProsperRating..numeric.  : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ MonthlyLoanPayment       : num  330 319 123 321 564 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EstimatedEffectiveYield  : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ BankcardUtilization      : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ TotalProsperLoans        : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ AmountDelinquent         : num  472 0 NA 10056 0 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ CurrentCreditLines       : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable         : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperPrincipalBorrowed : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ listingCatAlpha          : chr  "Not Available" "Home Improvement" "Not Available" "Motorcycle" ...
##  $ TermFactored             : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ ListingCatFactored       : Factor w/ 21 levels "0","1","2","3",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ ListingCategoryAggr      : chr  "Not Available" "Home Improvement" "Not Available" "Vehicle" ...
##  $ LoanOrigDate             : POSIXct, format: "2007-09-12" "2014-03-03" ...
##  $ LoanOrigMonth            : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 9 3 1 11 9 12 4 5 12 12 ...
##  $ LoanOrigYear             : Factor w/ 10 levels "2005","2006",..: 3 10 3 8 9 9 9 9 9 9 ...
##  $ LoanOrigMonthYear        : Factor w/ 94 levels "Nov 2005","Dec 2005",..: 23 94 15 78 88 91 83 84 91 91 ...

What is/are the main feature(s) of interest in your dataset?

Estimated Return, EstimatedEffectiveYield, Listing Category, ProsperRating, Term, LoanOrigDate

LoanOriginationAmount (however this feature only became a feature of interest after running a correlation heatmap in my multivariate plot section)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From a lender’s perspective: - MonthlyLoanPayments, InterestRate

From the company’s perspective: - LoanStatus - IncomeRange

From a borrower’s perspective: - BorrowerState, Occupation, IncomeRange, DebtToIncomeRatio

Did you create any new variables from existing variables in the dataset?

Yes, I parsed out month and year from the loan origination date variable. Pulled out excess returns by listing category (see code later on in multivariate plots section)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, see previous comments/notes.

Bivariate/Multivariate Plots Section

Step 1.1 - Run a correlation matrix heatmap

After running a correlation matrix heatmap, you can visualize the strength of positive and negative correlations between numerical variables.

It is interesting to note a strong negative correlation between BorrowerRate and LoanOriginationAmount, and a positive correlation between MonthlyLoanPayment and LoanOriginationAmount. Does this make sense? Yes, but dissapointing that there arent stronger correlative patterns from ProsperRating and zero correlations with ProsperScore.

The variables with the most information gain, I would deem to be BorrowerRate, and LoanOriginalAmount as they both have the most occurences and mix of both negative and positive correlations with other variables. However, this matrix seems very limited, and my syntax doesn’t handle pairwise completion for NA variables.

Perhaps I need to tune parameters of the heat map to see correlations from a different perspective.

Step 1.2 - Reshape correlation matrix

## 
##  Pearson's product-moment correlation
## 
## data:  loans$ProsperScore and loans$EstimatedEffectiveYield
## t = -237.56, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6360272 -0.6279452
## sample estimates:
##        cor 
## -0.6320034

Ah! That is better, now I can see the strength of correlations between ProsperScore/ProsperRating and EstimatedEffectiveYield/BorrowerAPR/BorrowerRate are more pronounced. This makes more sense, a borrower with a lower rating gets a higher rate.

When I run a correlation matrix that is able to cluster order the correlations, I can detect patterns with more clarity. With my second heatmap correlation matrix we still see the correlation variance for LoanOriginalAmount with negative correlations of -0.33, -0.32, -0.33 with EstimatedEffectiveYield, BorrowerAPR, BorrowerRate and 0.34, 0.43, 0.27 with CreditScore, ProsperRating, and ProsperScore respectively. The negative correlations would indicate to me lower loan amounts get lower returns/lower rates because of less risk exposure.

It’s interesting to note that there is a relatively high correlation between ListingNumber (a seemingly arbitrary key number) and estimated return. Could this be related to Benford’s first digit law? I’m not particularly interested in pursuing this further, but just a thought!

The positive correlations would indicate to me a possible cap on the LoanOriginalAmount and Prosper Risk scores (credit, rating, and score) - the temporal nature of these loans on risk exposure is a category I’d like to explore further with a temporal/categorical crosssection.

Step 2 - Drill deeper into correlated variables with scatterplots

## 
##  Pearson's product-moment correlation
## 
## data:  loans$LoanOriginalAmount and loans$EstimatedEffectiveYield
## t = -100.89, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3332603 -0.3212446
## sample estimates:
##        cor 
## -0.3272657
## 
##  Pearson's product-moment correlation
## 
## data:  loans$ProsperScore and loans$EstimatedEffectiveYield
## t = -237.56, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6360272 -0.6279452
## sample estimates:
##        cor 
## -0.6320034
## 
##  Pearson's product-moment correlation
## 
## data:  loans$LoanOriginalAmount and loans$EstimatedReturn
## t = -86.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2922833 -0.2799279
## sample estimates:
##        cor 
## -0.2861175
## 
##  Pearson's product-moment correlation
## 
## data:  loans$ProsperScore and loans$EstimatedReturn
## t = -120.85, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3889362 -0.3774554
## sample estimates:
##        cor 
## -0.3832106

Took out the NA variables from the data to get a clearer picture of all the different categories. The first Estimated Effective Yield and Loan Original Amount scatterplot, shows the -0.33 correlation. The Estimated Effective Yield and Prosper Score scatterplot, shows the -0.63 correlation. The Estimated Return and Loan Original Amount scatterplot, shows the -0.29 correlation. The Estimated Return and Prosper Score scatterplot, shows the -0.38 correlation.

There seems to be smaller inner quartile ranges (IQRs) for estimated returns by Prosper Score than IQRs for Estimated Effective Yield (a cursory judgement of IQRs was enabled by using an alpha of 0.20 to indicate density of values), even though the correlation is stronger, as indicated by a steeper negative slope.

Step 3 - Categorical temporal histogram

From this graph we can see that there is cyclical variation of all categories of loans EXCEPT for Auto, Baby, Boat, and Business loans. However, the representation may be more pronounced as there are fewer cases of business and auto loans to visually see the variation.

When the Y axis (count) is transformed by squaring, there still doesn’t seem to be much cyclical variation for Auto, Baby, Boat, and Business loans.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

It was interesting to note the information gain from LoanOriginalAmount.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? What was the strongest relationship you found?

Please see notes above under plots.

I came to this section with thoughts about which variables that may have interesting relationships: RevolvingCreditBalance, IncomeRange, BankcardUtilization I.e. would a low bankcard utilization indicate low trust in banks? I’m curious about whether people’s relationships with their banks affect their prospensity to use P2P lending. And would this be correlated with IncomeRanges?

From the correlation heatmap matrix, I could quickly see that there wasnt a correlation between StatedMonthlyIncome and BankcardUtilization.

I also did not find strong correlations between DebToIncomeRatio and CurrentDelinquencies, two variables I would expect to be closely linked.

Multivariate Plots Section

mean/median prosperscore by listing category

When we aggregate the category variables we can see the categories a little more clearly. The huge spread of NA loans makes the case for loans to be categorized as earlier stated in my initial analysis of the opacity of the loans.

The huge variation in green loans from 2010-2012 is interesting to note. These are loans for home improvement and green energy financing. Dipping below 4% and peaking above 2% three different times. This huge fluctuation is concurrent with the whole renewable energy market volatility.

loan original amount # Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In the loan category excess returns by month, there seems to be quite a lot of volatility across the categories as a whole, but perhaps need to filter by category to see interesting patterns. Please see Plot 2 of Final Plots and Summary for further analysis.

Were there any interesting or surprising interactions between features?

We see that the NA loans have the highest variability in returns which is understandable considering that it is probably a huge random assortment bucket of loans.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The limitations of my model, as outlined in the final analysis section, is that the Sharpe ratio is in this case drawn endogenously, benchmarking against the average returns of the Prosper Market. For future work, I would like to benchmark against the SP500 and other indices. This sort of work would be useful in selling P2P lending as a platform to institutional lenders.

The Beta and Covariance between the listing categories would be helpful to compute the max Sharpe ratio of a portfolio. This could serve as a dashboard/widget as a helpful product the platform could provide/sell to lenders (both institutional and regular).


Final Plots and Summary

Draw comparisons. Identify trends. Engage a wide audience. Explain a complicated finding. Clarify a gap between perception and reality. Enable the reader to digest large amounts of information.

Plot One

## 
##  Pearson's product-moment correlation
## 
## data:  loans$ProsperScore and loans$EstimatedEffectiveYield
## t = -237.56, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6360272 -0.6279452
## sample estimates:
##        cor 
## -0.6320034
## 
##  Pearson's product-moment correlation
## 
## data:  loans$ProsperScore and loans$EstimatedReturn
## t = -120.85, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3889362 -0.3774554
## sample estimates:
##        cor 
## -0.3832106

Description One

The correlation matrix was able to effectively give a snapshot overview of some of the most important relationships amongst the numerical variables.

I chose to highlight the ProsperScore scatterplots measured against Estimate Effective Yield and Estimated Returns. These relationships had some of the strongest negative correlations (-0.63 and -0.38 respectively)

I am surprised at the larger spread for (future) estimated yields, rather than (retrospective) estimated returns which at first indicated to me that the Prosper Score was good at predicting estimated returns, but I realized Prosper Score may be a variable that was derived from estimated returns (and not the other way around).

Upon second look at the prosper score scatter charts, the large spread in the estimated yields chart makes sense, as a lot of these estimates are closer to 0.3, really demonstrating the fallibility of human optimism in future yields vs. the reality of returns!

Plot Two

##  listingCatAlpha     meanExcess          sdReturn        
##  Length:21          Length:21          Length:21         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       n                sharpeR         
##  Length:21          Min.   :-0.110689  
##  Class :character   1st Qu.:-0.029824  
##  Mode  :character   Median :-0.009087  
##                     Mean   : 0.010682  
##                     3rd Qu.: 0.054386  
##                     Max.   : 0.186610  
##                     NA's   :7

Description Two

My second plots show first the excess returns be category with the mean of the endogenous Sharpe ratio sitting at 1%. The next plot shows the moving averages of the excess returns across listing categories (the Estimated Returns, grouped by listing category and effectively binned by month of the year).

We can see a huge gap in loans representative of the drop in loans from 2008-2009, but steady variation across many categories. This plot could be further refined to filter by large standard deviations.

In the Category Excess Returns plot, we can see that household, student use, wedding, medical/dental, other and loans without a category listed have the highest excess returns.

For portfolio optimization perspective a rolling Sharpe Ratios were plotted on a month by month basis for each listing category. The variance in the Student Loans brought the Sharpe Ratio down to zero, and they are deemed to not be a good asset due to risk. Cosmetic loans again seem to be a good choice of asset class, with the highest Sharpe ratios.

Also, for follow on work, we can pull historical treasure bill data and index the Sharpe ratios against the mean and std of the SP500 instead of the Prosper loans total. This will determine risk exposure indexed against the market, and extend portfolio optimization exogenously beyond just the Prosper Loan market.

Plot Three

Description Three

I wanted to drill deeper into the risk of student loans.In light of recent news about 40% of students defaulting on 200 billion of government loans (include link). I want to see if there is a visual pattern that would indicate or add any predictive value to an imbalanced situation like this.

I want to highlight my third plot, a count histogram of Student Use loans, where I discovered that these loans actually stopped in 2010. This explains the lower yields than returns for student loans noted in my initial univariate box plots Perhaps this is due to platform regulatory issues, or recording practices, or poor performance of loans.

I further chose to investigate the historical performance of the loans as a possible causal mechanism for why Student Use loans stopped in 2010. Excess returns were at an all time high right before they stopped, so we can discount poor returns as a causal mechanism for Student Use loans stopping in 2010.


Reflection

The dataset contained records of almost 114,000 loans from Nov 2005 - March 2014. Over the course of those years, Prosper made close to $1 trillion dollars in loans.

Running through the process, univariate to bivariate to multivariate analysis, I actually found it really useful to work backwards from a correlation heatmap matrix (a multivariate plot). I will definitely use this tool for future analyses where I am working with data with a lot of variables.

Where did I run into difficulties in the analysis?

I struggled in trying to cut through the noise of all initial 81 variables to select variables that contained the most ‘information’. I ran a cursory/manual parse to cut the full set down to 31 variables, and then further cut the set down to 21 numerical values to run my correlation matrix.

I struggled with my initial correlation heatmap, using a very manual tutorial that did not populate the matrix very effectively.

I also struggled in reshaping my dataframes with dplyr in a way to compute sharpe ratios and other risk factors. ie running multiple summarise() functions on different grouping levels of the data proved to be a bit more convoluted than I anticipated, but perhaps there is a shorter way (input appreciated!)

Where did I find successes?

I found eventual success in installing and running the corrplot package to construct a corrplot matrix that clustered the R values of my numerical variables.

I also found eventual success playing around with mutate() and fulljoin() functions to prep the data to compute Sharpe Ratios.

How could the analysis be enriched in future work (e.g. additional data and analyses)?

Perhaps I may be able to pull out explanatory factors from all 81 variables with a PCA classifier, or other machine learning techniques taught in Project 5 of this course.

Perhaps we can continue exploring patterns with categorical data. http://www.r-statistics.com/2010/04/correlation-scatter-plot-matrix-for-ordered-categorical-data/

As well, I’d like to run an analysis and visualization of the listing category Sharpe Ratios, and train good portfolio management classifier.

In light of recent news about 40% of students are not paying $200 billion worth of government loans, I want to see if there are more visual patterns I can explore that would indicate or add any predictive value to a tenuous situation like this or combine it with exogenous market variables to assess risk in student loan lending on Prosper. However, the Student use of Prosper loans seems to stop in 2010, so an exploration of why there are no more student use listings after 2010 is necessary.